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Abstract 

With the rapid growth of web images, hashing has re¬ 
ceived increasing interests in large scale image retrieval. 
Research efforts have been devoted to learning compact bi¬ 
nary codes that preserve semantic similarity based on la¬ 
bels. However, most of these hashing methods are designed 
to handle simple binary similarity. The complex multilevel 
semantic structure of images associated with multiple labels 
have not yet been well explored. Here we propose a deep 
semantic ranking based method for learning hash functions 
that preserve multilevel semantic similarity between multi¬ 
label images. In our approach, deep convolutional neu¬ 
ral network is incorporated into hash functions to jointly 
learn feature representations and mappings from them to 
hash codes, which avoids the limitation of semantic rep¬ 
resentation power of hand-crafted features. Meanwhile, a 
ranking list that encodes the multilevel similarity informa¬ 
tion is employed to guide the learning of such deep hash 
functions. An effective scheme based on surrogate loss is 
used to solve the intractable optimization problem of non¬ 
smooth and multivariate ranking measures involved in the 
learning procedure. Experimental results show the supe¬ 
riority of our proposed approach over several state-of-the- 
art hashing methods in term of ranking evaluation metrics 
when tested on multi-label image datasets. 


1. Introduction 

Representing images efficiently is an important task for 
large scale content-based image retrieval. Binary hashing 
has attracted extensive attention due to computational and 
storage efficiencies of binary hash codes. It aims to map 
high-dimensional image data to compact binary codes in a 
Hamming space while maintaining some notion of similar¬ 
ity (e.g., metric similarity in the original feature space or 
semantic similarity based on labels). 

Early hashing methods are data-independent, such as lo¬ 
cality sensitive hashing [5] and its variants, which use ran¬ 
dom projections as hash functions without exploring the 
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Figure 1. The proposed deep semantic ranking based hashing. 
Solid and hollow arrows indicate forward and backward propaga¬ 
tion directions of features and gradients respectively. Hash func¬ 
tions consist of deep convolutional neural network (CNN) and bi¬ 
nary mappings of the feature representation from the top hidden 
layers of CNN. Multilevel semantic ranking information is used to 
learn such deep hash functions to preserve the semantic structure 
of multi-label images. 


data distribution. Recently, various data-dependent hashing 
methods have been proposed, which learn hash functions 
according to the data distribution. Some of them mainly fo¬ 
cus on preserving the metric structure of the data in the orig¬ 
inal feature space, such as spectral hashing [32] and binary 
reconstructive embedding [14]. However, distance metrics 
(e.g.. Euclidean distance) in the original space sometimes 
cannot measure well the semantic similarity that is essential 
for image retrieval. 

To preserve semantic structure of the data, hashing meth¬ 
ods with supervisory information in form of class labels 
have been further developed [24, 27, 28, 19, 17]. Through 
formulating hash function learning as a classification prob¬ 
lem or as an optimization problem of pairwise relation 
based loss functions, these methods are able to learn hash 
codes which preserve binary semantic similarity. But in 
practice images are usually simultaneously associated with 
multiple semantic labels, and in this case the similarity re- 








































lationship between two images is more complex and is usu¬ 
ally relevant to the number of common labels that images 
have. Consequently, a multilevel measure (such as very 
similar, normally similar and dissimilar) is required to de¬ 
scribe the similarity, which cannot be handled well by the 
above methods and has not been studied well. 

Besides, the learning capability of the standard pipeline 
followed by most hashing method, i.e., firstly extracting 
features like GIST [21] and SIFT [18] as image represen¬ 
tations, and then learning mappings from these representa¬ 
tions to binary codes, is inadequate for dealing with rela¬ 
tively complex semantic structure due to the semantic in¬ 
formation loss in the hand-crafted features. Thus more ef¬ 
fective semantic feature representation is also desirable. 

In this paper, we introduce a novel framework based 
on semantic ranking and deep learning model for learning 
hash functions that preserve multilevel similarity between 
multi-label images in the semantic space. An overall view 
of the proposed framework termed deep semantic ranking 
based hashing (DSRH) is illustrated in Fig. 1. Here we use 
deep convolutional neural network (CNN) [13] to construct 
hash functions to learn directly from images, which pro¬ 
vides much richer sematic information than hand-crafted 
features. Meanwhile, we learn such deep hash functions 
with semantic ranking supervision which is the order of a 
ranking list derived from shared class labels between query 
and database images. The learning is a joint optimization 
of feature representation and mappings from them to hash 
codes, and it is more effective than the conventional two- 
stage pipeline. A ranking loss defined on a set of triplets 
is used as surrogate loss to solve the optimization problem 
resulting from nonsmooth and multivariate ranking mea¬ 
sures, and then the stochastic gradient descent algorithm 
can be used to optimize model parameters. We evaluate the 
proposed DSRH method on a couple of multi-label image 
datasets and compare it with several state-of-the-art hashing 
methods based on both hand-crafted features and activation 
features from the CNN model. Experimental results demon¬ 
strate that our method is able to capture complex multilevel 
semantic structure and significantly outperforms other hash¬ 
ing methods in ranking quality. 

Our main contributions include: 1) A novel hash func¬ 
tion learning formwork is proposed to combine semantic 
ranking and deep learning model to address the problem 
of preserving multilevel semantic similarity between multi¬ 
label images. To the best of our knowledge, it is the first 
time to exploit deep convolutional neural network with list- 
wise ranking supervision for hashing. 2) A scheme based 
on surrogate losses is applied to the proposed framework to 
effectively solve the optimization problem of ranking mea¬ 
sures. 3) Our method boosts the benchmark of multi-label 
image retrieval, achieving the state-of-the-art performance 
in terms of ranking evaluation metrics. 


The rest of this paper is organized as follows. Related 
work is briefiy discussed in Section 2. The proposed deep 
semantic ranking based hashing is formulated and opti¬ 
mized in Section 3. Experimental evaluations are presented 
in Section 4. Einally, Section 5 concludes this paper. 

2. Related Work 

As described before, the existing hash methods can be 
roughly divided into two categories: data-independent and 
data-dependent. Here we mainly discuss data-dependent 
hash methods preserving the semantic structure which this 
paper focuses on. Iterative quantization with canonical cor¬ 
relation analysis (CCA-ITQ) [8] utilizes CCA with labels 
to reduce the dimensionality of input data and binarizes the 
outcome through minimizing the quantization error, where 
only the pointwise label information is exploited to guide 
hash function learning. By comparison, some approaches 
try to preserve the semantic similarity based on pairwise 
relation. Boosted similarity sensitive coding (BSSC) [24] 
assigns each pair of data points a label to learn a set of 
weak classifiers as hash functions. Semi-supervised hash¬ 
ing (SSH) [28] minimizes an empirical error over the la¬ 
beled pairs of points and makes hash codes balanced and 
uncorrelated to avoid overfitting. Motivated by latent struc¬ 
tural SVM, minimal loss hashing (MLH) [19] proposes a 
pairwise hinge-like loss function and minimizes its upper 
bound to learn similarity-preserving binary codes. 

Eurthermore, order-preserving approaches, which are 
more related to this paper, explicitly use ranking informa¬ 
tion in objective functions to learn hash codes that preserve 
the similarity order in the feature or semantic space. Order 
preserving hashing (OPH) [3 1 ] formulates an alignment be¬ 
tween the similarity orders computed respectively from the 
original Euclidean space and the Hamming space, which 
can be solved using the quadratic penalty algorithm. On the 
basis of [19], hamming distance metric learning (HDML) 
[20] develops a metric learning framework based on a triplet 
ranking loss to preserve relative similarity. However, this 
triplet loss function only considers local ranking informa¬ 
tion and is limited in capturing information about multi¬ 
level similarity. By using a triplet representation for list- 
wise supervision, ranking-based supervised hashing (RSH) 
[29] minimizes the inconsistency of ranking order between 
the hamming and original spaces to keep global ranking or¬ 
der. Different from RSH, our method leverages deep learn¬ 
ing model to discover deeper semantic similarity and can 
scale well on large training sets. Column generation hash¬ 
ing (CGH) [15] and StructHash [16] combine ranking in¬ 
formation with the boosting framework to learn a weighted 
hamming embedding. In contrast, our method needs no ex¬ 
tra weight to rank hash codes. 

Deep learning models, particularly deep convolutional 
neural networks (CNNs), have achieved great success in 


various visual tasks such as image classification, annota¬ 
tion, retrieval and object detection [13, 7, 4, 30, 6] due 
to their powerful representation learning capability. Some 
ranking loss based CNNs have been explored in these tasks. 
Wang et al. [30] use a ranking loss based on triplet sam¬ 
pling in CNNs to learn image similarity metric. Gong et 
al. [7] incorporate a warp approximate ranking into CNNs 
for image annotation. There are a few hashing methods that 
also use deep models. Salakhutdinov et al. [23] use a deep 
generative model as hash functions. Similarly, Torralba et 
al. [27] model a deep network by using multiple layers of 
RBMs. Given approximate hash codes learned from pair¬ 
wise similarity matrix decomposition, Xia et al. [33] learn 
hash functions using CNNs to fit the learned hash codes. 
However, these methods do not explicitly impose the rank¬ 
ing constraint on the deep models, which can not figure out 
the multi-level similarity problem. 

3. Our Method 

In general, a hash function h : ^ { — 1,1} is treated 

as a mapping that projects a Z)-dimensional input onto a 
binary code. Assume that we are given a set of class la¬ 
bels jC = C} and a dataset V = {xn}^=i where 

each data point x G is associated with a subset of 
labels y ^ C, our goal is to learn a set of hash func¬ 
tions h(x) = [/ii(x), /i 2 (x),..., that generates K- 

bit (K D) binary codes while preserving the semantic 
structure of data points with multiple labels. 

3.1. Deep Hash Functions 

A good form of hash functions is important for ob¬ 
taining desirable hash codes. As mentioned earlier, most 
conventional hashing methods first extract visual features 
like GIST and SIFT from images and then learn “shallow” 
(usually linear) hash functions upon these features. How¬ 
ever, these hand-crafted features have limited representa¬ 
tion power and may lose key semantic information which 
is important to the task of similarity search. Here we con¬ 
sider designing deep hash functions using CNNs to jointly 
learn feature representations from raw pixels of images and 
their mappings to hash codes. This non-linear hierarchical 
hash function has more powerful learning capability than 
the shallow one based on features extracted in advance, and 
thus is able to learn feature representations more suitable 
for multilevel semantic similarity search. 

As shown in Fig. 2, we construct hash functions through 
incorporating the CNN model whose architecture is the 
same as [12]. A deep feature representation is computed by 
forward propagating a mean-subtracted 224 x 224 image 
through five convolutional layers and two fully connected 
layers, and then fed into the last hash layer to generate a 
compact binary code. Please refer to [13, 12] for more de¬ 
tails about the geometry of the convolutional layers, local 



representation 


Hash code 


Figure 2. The structure of deep hash functions. An input image is 
first transformed to a fixed size, and then goes through five con¬ 
volution layers and two fully-connected layers, which provides a 
deep feature representation. Finally, the hash layer generates a 
compact binary code. The hash layer is also directly connected 
to the first fully-connected layer (FCa) in order to utilize diverse 
feature information biased toward visual appearance. 


normalization and max pooling. Unlike image classifica¬ 
tion, hash codes are expected to contain global feature in¬ 
formation within an image when used for retrieval. Thus 
instead of cropping an image, we warp all pixels in the im¬ 
age to the required size. Inspired by [26], we add a bypass¬ 
ing connection between the first fully connected layer (FCa) 
(called the skipping layer) and the hash layer to reduce the 
possible information loss. We argue that the features from 
the second fully connected layer (FCb) of CNN are depen¬ 
dent on classes too much and have strong invariance, which 
is unfavorable for capturing subtle sematic distinction. Thus 
we connect the hash layer to both the two fully-connected 
layers to enable it encoding more diverse information bi¬ 
ased toward visual appearance. Accordingly we define a 
deep hash function as: 

/i(x;w) = sign(w'^[/a(x);/b(x)]), (1) 

where w denotes weights in the hash layer, fa{.) and 
denote feature vectors from the outputs of the layers FCa 
and FCb respectively and can be represented as the compo¬ 
sition of the functions of previous layers. Here bias terms 
and parameters of /a(.) and are omitted for the sake 
of concision. To obtain a X-bit binary code, h(x; W) = 
[/ii(x; wi), /i 2 (x; W 2 ),/?'k(x; w^^)] can be computed. 

3.2. Semantic Ranking Supervision 

When each data point in V is associated with a single 
class label, pairs of points could be labelled as either sim¬ 
ilar or dissimilar according to whether they have the same 
label, and the learning procedure of hash functions is simply 
to make the Hamming distances between binary codes small 
(large) for similar (dissimilar) pairs. However, in the case 
of multiple labels, there exits multilevel similarity between 
data points depending on how many common labels they 
have. To preserve such multilevel semantic structure, one 
of the most essential ways is that for individual data points, 
we keep the ranking order of neighbors computed by the 











Hamming distance consistent with those derived from se¬ 
mantic labels in terms of ranking evaluation measures. 

Assume we have a sample point from P as a query q. For 
the query q, a semantic similarity level r of a database point 
X with q can be calculated based on the number of their 
common labels. The most similar database points are those 
sharing all the labels with q, and assigned a level r = \yq\. 
Accordingly, the second similar points, which share any 
\yq\ — 1 of the labels, are assigned a level r = \yq\ — 1. 
At last, the dissimilar points share none of the labels and 
are assigned a level r = 0. And then we can obtain a 
ground-truth ranking list for q by sorting the database points 
in decreasing order of their similarity levels. According to 
the ground-truth ranking, various evaluation criteria can be 
used to measure the consistency of the rankings predicted 
by hash functions, such as the Normalized Discounted Cu¬ 
mulative Gain (NDCG) score [10], which is a popular mea¬ 
sure in the information retrieval community and defined as: 


used in Ranking SVM [11] for leaning to rank where a score 
function for ranking is learned. 

From the definition of NDCG, it can be observed that 
the top ranked items have a larger gain factor for the score, 
which better refiects the performance of the ranking models 
in practical image retrieval systems, because users usually 
pay most of their attentions to the results on the first few 
pages. Thus we wish that the ranking of these items could 
be predicted more accurately than others. However, (3) 
treats all triplets equally, which is not desired. Inspired by 
[1], we modify the ranking loss by adding adaptive weights 
related to the similarity levels of database points: 

^c^(h(q),{h(xj)},=i) = 

M 

y] y] u:{ri,rj)[SdH{H^),h{Ki)M^j)) +P]+- 

i=l j:rj <ri 

(4) 


NDCG@p 



i=l 


2^ _ I 

log(l + i) ’ 


( 2 ) 


where p is the truncated position in a ranking list, Z is a nor¬ 
malization constant to ensure that the NDCG score for the 
correct ranking is one, and is the similarity level of the 
i-th database point in the ranking list. Directly optimizing 
such ranking criteria is intractable, which involves minimiz¬ 
ing nonsmooth and multivariate ranking losses. One solu¬ 
tion is to regard it as a problem of structured output learn¬ 
ing and optimize the problem by structured SVM. But this 
framework is not suitable for deep learning model which is 
used to construct our hash functions. Next we will discuss 
a simple but effective scheme based on surrogate loss. 

3.3. Optimization with Surrogate Loss 

To circumvent the problem of directly optimizing the 
ranking criteria, we try to use a surrogate loss as the risk 
that the learning procedure minimizes in practice. Given a 
query q and a ranking list for q, we can define a 

ranking loss on a set of triplets of hash codes as follows: 


i:(h(q),{h(x,)}i^i) = 

M 

y^ N^^(h(q)>h(xi),h(xj))(3) 

i=l j:rj<ri 

where M is the length of the ranking list, [.]+ = max(0,.), 
6dH{h, hi, ha) = d//(h, hi) - d//(h, ha), dni-, ■) is the 
Hamming distance and p is a margin parameter which con¬ 
trols the minimum margin between the distances of the two 
pairs. This surrogate loss is a convex upper bound on the 
pairwise disagreement which counts the number of incor¬ 
rectly ranked triplets. This type of surrogate loss has been 


According to NDCG, the weight uj can be given by: 

2^^ - 2^^- 
^- 


(5) 


where Z is the normalization constant in (2). The higher 
the relevance of x^ and q is than that of Xj and q, the larger 
decline the NDCG score would suffer if x^ is ranked behind 
Xj. And thus the larger weight should be assigned to this 
triplet. When uj{ri, rj) = 1, it corresponds to (3). 

Given the dataset D as a training set, we wish to learn 
hash functions that optimize the rankings for all query 
points q from V. Based on the surrogate loss (4) and the 
hash function (1), the objective function can be given by the 
empirical loss subject to some regularization: 


,F(W)= y^ L^(h(q;W),{h(xi;W)}i)£i) 

2 


a 

+ 2 


mean(h(q; W)) 
q 


+ fiiwiiy (6) 


Similar to [20], the second term is the balance penalty 
which is used to encourage each bit averaged over the train¬ 
ing data to be mean-zero and to make sure more stable con¬ 
vergence of the learning procedure. And the third term is the 
1/2 weight decay which penalizes large weights [12]. Due 
to the discontinuous sign function in (1), the optimization 
of (6) is difficult. To address this issue, we relax /i(x; w) 
to: 

ft,(x;w) = 2cr(w'^[/a(x);/b(x)]) - 1, (7) 

where cr(t) = 1/(1 + exp(—t)) is the logistic function. In 
order to facilitate the gradient computation, we rewrite the 
hamming distance as the form of inner product: 


/ijj(h(q;W),h(x;W)) 


7f-h(q; W)'^h(x;W) 


( 8 ) 
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Figure 3. Ranking performance evaluations (NDCG, ACG and weighted mAP) of different components using various numbers of hash bits 
on two datasets: (a) MIRFLICKR-25K and (b) NUS-WIDE. 


where K is the number of hash bits. 

Stochastic gradient descent is used to minimize the ob¬ 
jective function. It can be observed that the loss function (4) 
is actually a summation of a sequence of weighted triplet 
losses. For any one triplet (q, x^, Xj), if 

i(h(q; W)Th(x,; W) - h(q; W)Th(x,; W)) + p > 0, 

the derivatives of (6) with respect to hash code vectors are 
given by: 


dT 


a 


ah(q; W) Nq q 


= — inean(h(q;W)) 


+ -w(ri,r,)(h(x,; W) - h(x,; W)) (9) 


dT 

5h(xi;W) ^ 

dT 

ah(xj;W) 


W), 

= ^w(ri,rj)h(q;W). 


( 10 ) 

( 11 ) 


where the mean value is computed over one mini-batch and 
Nq is the size of a mini-batch. These derivative values can 
be fed into the underlying CNN via the back-propagation 
algorithm to update the parameters of each layer. 


4. Experiments 

We test the proposed hashing method on two multi-label 
benchmark datasets, i.e., MIRFLICKR-25K [9] and NUS- 
WIDE [2]. We present quantitative evaluations in terms 
of ranking measures and compare our method with unsu¬ 
pervised methods: iterative quantization (ITQ) [8], spec¬ 
tral hashing (SH) [32], and supervised methods using multi¬ 
label and ranking information respectively: CCA-ITQ [8], 
hamming distance metric learning (HDML) [20] . 

We set the mini-batch size for gradient descent to 128, 
and impose dropout with keeping probability 0.5 on the 
fully connected layers to avoid overfitting. The regulariza¬ 
tion parameter a and [3 in the objective function (6) are set 
to 1 and respectively. The length of the ground-truth 
ranking list used for training is set to 3, which can be cre¬ 
ated by taking one item sharing all the labels with a query, 
one item without any common label and one item having at 
least one common label. For all compared methods, we use 
the best settings reported in their literatures. 

The ImageNet ILSVRC-2012 dataset [22] is utilized to 
pre-train the CNN model by optimizing multinomial logis¬ 
tic regression objective function in the image classification 
task. This dataset contains about 1.2 million training im¬ 
ages and 50,000 validation images, roughly 1000 images 
in each of 1000 categories. We use the pre-trained param¬ 
eters of convolutional layers and fully-connected layers to 
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Figure 4. Comparison of ranking performance of our DSRH and other hashing methods based on hand-crafted features on two datasets: (a) 
MIRFLICKR-25K and (b) NUS-WIDE. 


initialize the CNN part of hash functions in our method. In 
order to ensure fairness, we apply the features from the pre¬ 
trained CNN model to the compared hashing methods as 
well. Furthermore, the compared methods are carried out 
using the features that are learned through fine-tuning the 
CNN model on the multi-label datasets in our retrieval task. 

4.1. Datasets 

The MIRFLICKR-25K dataset [9] consists of 25,000 im¬ 
ages collected from the social photography website Flickr. 
All images are annotated for 24 semantic concepts includ¬ 
ing various scenes and objects categories such as sky, night, 
food and tree. Moreover, 14 of these concepts are used for 
a stricter labeling, i.e., an image is annotated with a concept 
again only if the concept is salient. Thus we have total 38 
semantic labels where each image may belong to several la¬ 
bels. 2000 images are randomly selected as testing queries 
and the remaining images are used as the database for train¬ 
ing and retrieval. Following [25], a 3857-dimensional fea¬ 
ture vector for each image is extracted by concatenating 
Pyramid Histogram of Words (PHOW) features. Gist and 
MPEG-7 descriptors, which will be used for the compared 
methods. 

The NUS-WIDE dataset [2] is a relatively larger image 
dataset containing 269,648 images annotated with 81 con¬ 
cepts. It is also from Elickr, but more challenging than 
MIRELICKR-25K due to more images and more diverse 


semantic concepts. Since the website of the dataset does 
not provide raw image data but the URLs of images, for 
learning deep feature representations we download these 
images ourselves from the web. However, we only col¬ 
lected 226,265 images as some of those URLs have been 
invalid now. We randomly sample 5000 images for test¬ 
ing queries and the rest is used for training and retrieval. 
The dataset includes six types of low-level features ex¬ 
tracted from these images: bag of words based on SILT 
feature, color histogram, color correlogram, edge direction 
histogram, wavelet texture and block-wise color moments. 
We concatenate them all and get a 1134-dimensional feature 
representation for each image. 

4.2. Evaluation Criteria 

In our experiments. Normalized Discounted Cumulative 
Gain (NDCG) [10], Average Cumulative Gain (ACG) [10] 
and weighted mean Average Precision (mAP) are used to 
measure the ranking quality of retrieved database points. 
As mentioned before, NDCG defined in (2) evaluates the 
ranking of data points by penalizing errors in higher ranked 
items more strongly. 

ACG is calculated by taking the average of the similarity 
levels of data points within top-p positions: 

1 ^ 

ACG^p = - Vi, ( 12 ) 

P 1 

n=l 
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Figure 5. Comparison of ranking performance of our DSRH and other hashing methods based on activation features of pre-trained CNN 
on two datasets: (a) MIRFLICKR-25K and (b) NUS-WIDE. 


where is the similarity level of the data point on the i- 
th position of a ranking. ACG actually is equivalent to the 
precision weighted by the similarity level of each data point. 

mAP is the mean of average precision for each query. 
Since the similarity level r can be bigger than one, we com¬ 
pute a weighted mAP using ACG: 


1 ^ ^ 

~ 'q ^ ^ ^Pw (^) ; 


q=l 


Ep=i^{rp>0)ACG(^p 


ARn = 


(13) 


(14) 


Mryo 

where n(.) G {0,1} is an indicator function and Mr>o is 
the number of relevant data points. 


4.3. Evaluation of Different Components 


To analyze the effectiveness of several important com¬ 
ponents in the proposed DSRH, we remove the connec¬ 
tion between the hash layer and the skipping layer and set 
uj{ri,rj) = 1 in (4) to evaluate their influence on the fi¬ 
nal performance. These two models are called DSRH-NS 
and DSRH-NS-NW. Here NS denotes no skipping layer and 
NW denotes no adaptive weight. Fig. 3 shows the results in 
the ranking measures. 

We can see that using the surrogate loss with adaptive 
weights can improve the ranking quality of top-100 rele¬ 
vant items in terms of NDCG and ACG at the expense of 


the averaged ranking performance because it assigns larger 
weights to more relevant database points and weakens the 
effect of the less relevant ones. Connecting the first fully- 
connected layer to the hash layer can also improve the per¬ 
formance because more information biased toward visual 
appearance can be utilized which may be important for cap¬ 
turing multilevel semantic similarity. 

4.4. Method Comparison 

We also compare the proposed DSRH with other hash 
methods based on hand-crafted features. Fig. 4 illustrates 
the scores of NDCG, ACG and weighted mAP of these 
methods using various numbers of bits. We can see that 
the performance of our method is significantly better than 
other methods based on hand-crafted features in all cases. 
By using the CNN model to construct hash functions, our 
method have higher learning capability and is able to ex¬ 
ploit more semantic information than the hashing methods 
trained on hand-crafted features which are usually extracted 
by unsupervised and shallow models. 

We further evaluate the compared hashing methods on 
the features obtained from the activation of the last hid¬ 
den layer of the CNN model pre-trained on the ImageNet 
dataset. This activation feature can be seen as a generic vi¬ 
sual feature and has been used in object recognition, domain 
adaption and scene recognition [3]. The results of NDCG, 
ACG and weighted mAP are shown in Fig. 5. Although 
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Figure 6. Comparison of ranking performance of our DSRH and other hashing methods based on activation features of fine-tuned CNN on 
two datasets: (a) MIRFLICKR-25K and (b) NUS-WIDE. 


the activation features boost the ranking performance of 
the compared methods by a large margin, our method still 
has better performance because we construct the deep hash 
function to jointly learning feature representations and hash 
codes, which can utilize semantic supervision information 
to obtain features more fitted to the retrieval datasets. It is 
more effective than learning hash codes from the features 
learned in advance. 

To further verify the superiority of our method, we use 
the activation features fine-tuned on the retrieval datasets, 
i.e., MIRFLICKR-25K and NUS-WIDE, to evaluate the 
compared methods as well. Specifically, we retrain the 
CNN model for multi-label image classification by using 
cross-entropy cost function. We report the results in Fig. 6. 
It can be seen that our method still achieves the best rank¬ 
ing performance, which shows that the multilevel semantic 
ranking supervision can make hash function learning bet¬ 
ter preserve the semantic structure of multi-label images. 
CCA-ITQ also uses multi-label information, but dose not 
explicitly learn with the ranking. HDML performs even 
worse than the unsupervised ITQ because it only consid¬ 
ers binary similarity relationship which harms the seman¬ 
tic structure in the fine-tuned features. Note that using the 
features fine-tuned by multi-label supervision, the unsuper¬ 
vised ITQ performs almost as well as CCA-ITQ which is its 
supervised version, even better in terms of weighted mAR 

Similar to the skipping layer in the hash function of 


DSRH, we also attempt to concatenate the activations of 
the last two hidden layers of the CNN model as feature rep¬ 
resentations and apply them to the compared hashing meth¬ 
ods. However, the performance of the compared methods 
trained using these features become even worse. It further 
validates the effectiveness of the structure of our hash func¬ 
tion which has a tight coupling with CNN. 

5. Conclusion 

In this paper we have proposed to employ multilevel se¬ 
mantic ranking supervision to learn deep hash functions 
based on CNN which preserves the semantic structure of 
multi-label images. The CNN model with listwise rank¬ 
ing supervision is used to jointly learn feature representa¬ 
tions and mappings from them to binary codes. The re¬ 
sulting optimization problem of nonsmooth and multivari¬ 
ate ranking measure is solved by using a ranking loss on a 
set of triplets as the surrogate loss, which makes stochastic 
gradient descent could be used to optimize model parame¬ 
ters effectively. Extensive experiments demonstrate that the 
proposed method outperforms other state-of-the-art hashing 
methods in terms of ranking quality. 
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